Wisteria: Nurturing Scalable Data Cleaning Infrastructure

نویسندگان

  • Daniel Haas
  • Sanjay Krishnan
  • Jiannan Wang
  • Michael J. Franklin
  • Eugene Wu
چکیده

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowdsourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data cleaning workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replacements to the analyst’s choice of physical implementation. We highlight research challenges in sampling, in-flight operator replacement, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to showcase how Wisteria can improve iterative data analysis and cleaning. The code is available at: http://www.sampleclean.org.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Software Infrastructure for the CLEENEX Optimizer

The problems associated to data quality is an increasingly growing concern. Throughout this document we will focus on a specific data quality problem: the existence of approximate duplicate records. Data cleaning aims at correcting data quality problems that can be found in various situations. There are some data cleaning tools that address these data quality problems. One of the tasks of a dat...

متن کامل

A New Framework for Increasing the Sustainability of Infrastructure Measurement of Smart Grid

Advanced Metering Infrastructure (AMI) is one of the most significant applications of the Smart Grid. It is used to measure, collect, and analyze data on power consumption.  In the AMI network, the smart meters traffics are aggregated in the intermediate aggregators and forwarded to the Meter Data Management System (MDMS). The infrastructure used in this network should be reliable, real-time an...

متن کامل

Horticulture, hybrid cultivars and exotic plant invasion: a case study of Wisteria (Fabaceae)

Exotic Wisteria species are highly favoured for their horticultural qualities and have been cultivated in North America since the early 1800s. This study determines the identity, genetic diversity and hybrid status of 25 Asian Wisteria cultivars using plastid, mitochondrial and nuclear DNA data. Fifteen (60%) hybrid cultivars were identified. All of the ‘Wisteria sinensis’ cultivars sampled are...

متن کامل

Statistical Distortion: Consequences of Data Cleaning

We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvemen...

متن کامل

Bi-parental cytoplasmic DNA inheritance in Wisteria (Fabaceae): evidence from a natural experiment.

Cytoplasmic inheritance was investigated in interspecific hybrids of Wisteria sinensis and W. floribunda. Species-specific nuclear, mitochondrial and plastid DNA markers were identified from wild-collected plants of each species in its native range. These markers provide evidence for the bi-parental transmission of plastids in hybrid swarms of these two species in the southeastern USA. These po...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2015